Skip to content
Snippets Groups Projects
  1. Dec 08, 2015
    • Michael Armbrust's avatar
      [SPARK-12069][SQL] Update documentation with Datasets · 39594894
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #10060 from marmbrus/docs.
      39594894
    • Andrew Or's avatar
      [SPARK-12187] *MemoryPool classes should not be fully public · 94945216
      Andrew Or authored
      This patch tightens them to `private[memory]`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10182 from andrewor14/memory-visibility.
      94945216
    • Marcelo Vanzin's avatar
      [SPARK-3873][BUILD] Add style checker to enforce import ordering. · 2ff17bcf
      Marcelo Vanzin authored
      The checker tries to follow as closely as possible the guidelines of
      the code style document, and makes some decisions where the guide is
      not clear. In particular:
      
      - wildcard imports come first when there are other imports in the
        same package
      - multi-import blocks come before single imports
      - lower-case names inside multi-import blocks come before others
      
      In some projects, such as graphx, there seems to be a convention to
      separate o.a.s imports from the project's own; to simplify the
      checker, I chose not to allow that, which is a strict interpretation
      of the code style guide, even though I think it makes sense.
      
      Since the checks are based on syntax only, some edge cases may
      generate spurious warnings; for example, when class names start
      with a lower case letter (and are thus treated as a package name
      by the checker).
      
      The checker is currently only generating warnings, and since there
      are many of those, the build output does get a little noisy. The
      idea is to fix the code (and the checker, as needed) little by little
      instead of having a huge change that touches everywhere.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6502 from vanzin/SPARK-3873.
      2ff17bcf
    • BenFradet's avatar
      [SPARK-12159][ML] Add user guide section for IndexToString transformer · 06746b30
      BenFradet authored
      Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10166 from BenFradet/SPARK-12159.
      06746b30
    • Yuhao Yang's avatar
      [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs · 5cb46950
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-11605
      Check Java compatibility for MLlib for this release.
      
      fix:
      
      1. `StreamingTest.registerStream` needs java friendly interface.
      
      2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.
      
      TBD:
      [updated] no fix for now per discussion.
      `org.apache.spark.mllib.classification.LogisticRegressionModel`
      `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
      `SVMModel` has the similar issue.
      
      Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.
      
      cc jkbradley feynmanliang
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #10102 from hhbyyh/javaAPI.
      5cb46950
    • Andrew Ray's avatar
      [SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction · 4bcb8949
      Andrew Ray authored
      Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #10202 from aray/sql-pivot-unresolved-function.
      4bcb8949
    • Yuhao Yang's avatar
      [SPARK-10393] use ML pipeline in LDA example · 872a2ee2
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-10393
      
      Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
      
      Closes #8551 from hhbyyh/ldaExUpdate.
      872a2ee2
    • gatorsmile's avatar
      [SPARK-12188][SQL] Code refactoring and comment correction in Dataset APIs · 5d96a710
      gatorsmile authored
      This PR contains the following updates:
      
      - Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`.
      - Replaced all the `queryExecution.analyzed` by the function call `logicalPlan`
      - A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`)
      - A few API descriptions are wrong. (e.g., `mapPartitions`)
      
      marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10184 from gatorsmile/datasetClean.
      5d96a710
    • gatorsmile's avatar
      [SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encoder · c0b13d55
      gatorsmile authored
      This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`.
      
      marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10188 from gatorsmile/dataTypesinEncoder.
      c0b13d55
    • Wenchen Fan's avatar
      [SPARK-12201][SQL] add type coercion rule for greatest/least · 381f17b5
      Wenchen Fan authored
      checked with hive, greatest/least should cast their children to a tightest common type,
      i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10196 from cloud-fan/type-coercion.
      381f17b5
    • tedyu's avatar
      [SPARK-12074] Avoid memory copy involving ByteBuffer.wrap(ByteArrayOutputStream.toByteArray) · 75c60bf4
      tedyu authored
      SPARK-12060 fixed JavaSerializerInstance.serialize
      This PR applies the same technique on two other classes.
      
      zsxwing
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #10177 from tedyu/master.
      75c60bf4
    • Xin Ren's avatar
      [SPARK-11155][WEB UI] Stage summary json should include stage duration · 6cb06e87
      Xin Ren authored
      The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at api/v1/applications/<appId>/stages.
      
      Metrics I've added are: submissionTime, firstTaskLaunchedTime and completionTime
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #10107 from keypointt/SPARK-11155.
      6cb06e87
    • Sean Owen's avatar
      [SPARK-11652][CORE] Remote code execution with InvokerTransformer · e3735ce1
      Sean Owen authored
      Fix commons-collection group ID to commons-collections for version 3.x
      
      Patches earlier PR at https://github.com/apache/spark/pull/9731
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10198 from srowen/SPARK-11652.2.
      e3735ce1
    • Cheng Lian's avatar
      [SPARK-11551][DOC][EXAMPLE] Revert PR #10002 · da2012a0
      Cheng Lian authored
      This reverts PR #10002, commit 78209b0c.
      
      The original PR wasn't tested on Jenkins before being merged.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10200 from liancheng/revert-pr-10002.
      da2012a0
    • Nakul Jindal's avatar
      [SPARK-11439][ML] Optimization of creating sparse feature without dense one · 037b7e76
      Nakul Jindal authored
      Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.
      
      Author: Nakul Jindal <njindal@us.ibm.com>
      
      Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
      037b7e76
    • Jeff Zhang's avatar
      [SPARK-12166][TEST] Unset hadoop related environment in testing · 70812918
      Jeff Zhang authored
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10172 from zjffdu/SPARK-12166.
      70812918
    • cody koeninger's avatar
      [SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V … · 48a9804b
      cody koeninger authored
      …means Value
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #10132 from koeninger/SPARK-12103.
      48a9804b
    • Yanbo Liang's avatar
      [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code · 4a39b5a1
      Yanbo Liang authored
      Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10006 from yanboliang/spark-11958.
      4a39b5a1
    • Takahashi Hiroshi's avatar
      [SPARK-10259][ML] Add @since annotation to ml.classification · 7d05a624
      Takahashi Hiroshi authored
      Add since annotation to ml.classification
      
      Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
      
      Closes #8534 from taishi-oss/issue10259.
      7d05a624
    • Xiangrui Meng's avatar
      Closes #10098 · 73896588
      Xiangrui Meng authored
      73896588
    • somideshmukh's avatar
      [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using include_example · 78209b0c
      somideshmukh authored
      Made new patch contaning only markdown examples moved to exmaple/folder.
      Ony three  java code were not shfted since they were contaning compliation error ,these classes are
      1)StandardScale 2)NormalizerExample 3)VectorIndexer
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: somideshmukh <somilde@us.ibm.com>
      
      Closes #10002 from somideshmukh/SomilBranch1.33.
      78209b0c
  2. Dec 07, 2015
    • Joseph K. Bradley's avatar
      [SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib · 3e7e05f5
      Joseph K. Bradley authored
      Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.
      
      This covers all instances in spark.mllib.  There were no uses of the constructor in spark.ml.
      
      CC: mengxr yhuai
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10161 from jkbradley/mllib-sqlcontext-fix.
      3e7e05f5
    • Andrew Ray's avatar
      [SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala doc · 36282f78
      Andrew Ray authored
      In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #10176 from aray/sql-pivot-python-doc.
      36282f78
    • tedyu's avatar
      [SPARK-11884] Drop multiple columns in the DataFrame API · 84b80944
      tedyu authored
      See the thread Ben started:
      http://search-hadoop.com/m/q3RTtveEuhjsr7g/
      
      This PR adds drop() method to DataFrame which accepts multiple column names
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #9862 from ted-yu/master.
      84b80944
    • Xusen Yin's avatar
      [SPARK-11963][DOC] Add docs for QuantileDiscretizer · 871e85d9
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11963
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #9962 from yinxusen/SPARK-11963.
      871e85d9
    • Shixiong Zhu's avatar
      [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize · 3f4efb5c
      Shixiong Zhu authored
      Merged #10051 again since #10083 is resolved.
      
      This reverts commit 328b757d.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10167 from zsxwing/merge-SPARK-12060.
      3f4efb5c
    • Tathagata Das's avatar
      [SPARK-11932][STREAMING] Partition previous TrackStateRDD if partitioner not present · 5d80d8c6
      Tathagata Das authored
      The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004).
      
      While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #9988 from tdas/SPARK-11932.
      5d80d8c6
    • Davies Liu's avatar
      [SPARK-12132] [PYSPARK] raise KeyboardInterrupt inside SIGINT handler · ef3f047c
      Davies Liu authored
      Currently, the current line is not cleared by Cltr-C
      
      After this patch
      ```
      >>> asdfasdf^C
      Traceback (most recent call last):
        File "~/spark/python/pyspark/context.py", line 225, in signal_handler
          raise KeyboardInterrupt()
      KeyboardInterrupt
      ```
      
      It's still worse than 1.5 (and before).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10134 from davies/fix_cltrc.
      ef3f047c
    • Sun Rui's avatar
      [SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases. · 39d677c8
      Sun Rui authored
      This PR:
      1. Suppress all known warnings.
      2. Cleanup test cases and fix some errors in test cases.
      3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
      4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
      5. Make sure the default Hadoop file system is local when running test cases.
      6. Turn on warnings into errors.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10030 from sun-rui/SPARK-12034.
      39d677c8
    • Davies Liu's avatar
      [SPARK-12032] [SQL] Re-order inner joins to do join with conditions first · 9cde7d5f
      Davies Liu authored
      Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.
      
      This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.
      
      After this patch, the TPCDS query Q64/65 can run hundreds times faster.
      
      cc marmbrus nongli
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10073 from davies/reorder_joins.
      9cde7d5f
    • Burak Yavuz's avatar
      [SPARK-12106][STREAMING][FLAKY-TEST] BatchedWAL test transiently flaky when Jenkins load is high · 6fd9e70e
      Burak Yavuz authored
      We need to make sure that the last entry is indeed the last entry in the queue.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #10110 from brkyvz/batch-wal-test-fix.
      6fd9e70e
  3. Dec 06, 2015
    • Josh Rosen's avatar
      [SPARK-12152][PROJECT-INFRA] Speed up Scalastyle checks by only invoking SBT once · 80a824d3
      Josh Rosen authored
      Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10151 from JoshRosen/speed-up-scalastyle.
      80a824d3
    • gatorsmile's avatar
      [SPARK-12138][SQL] Escape \u in the generated comments of codegen · 49efd03b
      gatorsmile authored
      When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.
      
      yhuai Please review it. I did reproduce it and it works after the fix. Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10155 from gatorsmile/escapeU.
      49efd03b
    • gcc's avatar
      [SPARK-12048][SQL] Prevent to close JDBC resources twice · 04b67999
      gcc authored
      Author: gcc <spark-src@condor.rhaag.ip>
      
      Closes #10101 from rh99/master.
      04b67999
    • Yanbo Liang's avatar
      [SPARK-12044][SPARKR] Fix usage of isnan, isNaN · b6e8e63a
      Yanbo Liang authored
      1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
      2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
      <del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>
      
      cc shivaram sun-rui felixcheung
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10037 from yanboliang/spark-12044.
      b6e8e63a
  4. Dec 05, 2015
Loading