Skip to content
Snippets Groups Projects
  1. Jul 30, 2015
    • Wenchen Fan's avatar
      [SPARK-9390][SQL] create a wrapper for array type · c0cc0eae
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7724 from cloud-fan/array-data and squashes the following commits:
      
      d0408a1 [Wenchen Fan] fix python
      661e608 [Wenchen Fan] rebase
      f39256c [Wenchen Fan] fix hive...
      6dbfa6f [Wenchen Fan] fix hive again...
      8cb8842 [Wenchen Fan] remove element type parameter from getArray
      43e9816 [Wenchen Fan] fix mllib
      e719afc [Wenchen Fan] fix hive
      4346290 [Wenchen Fan] address comment
      d4a38da [Wenchen Fan] remove sizeInBytes and add license
      7e283e2 [Wenchen Fan] create a wrapper for array type
      c0cc0eae
    • Yuu ISHIKAWA's avatar
      [SPARK-9248] [SPARKR] Closing curly-braces should always be on their own line · 7492a33f
      Yuu ISHIKAWA authored
      ### JIRA
      [[SPARK-9248] Closing curly-braces should always be on their own line - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9248)
      
      ## The result of `dev/lint-r`
      [The result of `dev/lint-r` for SPARK-9248 at the revistion:6175d6cf](https://gist.github.com/yu-iskw/96cadcea4ce664c41f81)
      
      Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7795 from yu-iskw/SPARK-9248 and squashes the following commits:
      
      c8eccd3 [Yuu ISHIKAWA] [SPARK-9248][SparkR] Closing curly-braces should always be on their own line
      7492a33f
    • Xiangrui Meng's avatar
      [MINOR] [MLLIB] fix doc for RegexTokenizer · 81464f2a
      Xiangrui Meng authored
      This is #7791 for Python. hhbyyh
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7798 from mengxr/regex-tok-py and squashes the following commits:
      
      baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
      81464f2a
    • Sean Owen's avatar
      [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when... · ed3cb1d2
      Sean Owen authored
      [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length
      
      Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
      
      CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7794 from srowen/SPARK-9277 and squashes the following commits:
      
      e8dc31e [Sean Owen] Fix scalastyle
      6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
      ed3cb1d2
    • Meihua Wu's avatar
      [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents · a6e53a9c
      Meihua Wu authored
      Add unit tests for running LDA with empty documents.
      Both EMLDAOptimizer and OnlineLDAOptimizer are tested.
      
      feynmanliang
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7620 from rotationsymmetry/SPARK-9225 and squashes the following commits:
      
      3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
      f9432e8 [Meihua Wu] Incorporate reviewer's comments
      8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225
      ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents
      a6e53a9c
    • Yuhao Yang's avatar
      [SPARK-] [MLLIB] minor fix on tokenizer doc · 9c0501c5
      Yuhao Yang authored
      A trivial fix for the comments of RegexTokenizer.
      
      Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7791 from hhbyyh/docFix and squashes the following commits:
      
      cdf2542 [Yuhao Yang] minor fix on tokenizer doc
      9c0501c5
    • zhangjiajin's avatar
      [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases · d212a314
      zhangjiajin authored
      Continuation of work by zhangjiajin
      
      Closes #7412
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: Feynman Liang <fliang@databricks.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits:
      
      a61943d [Feynman Liang] Collect small patterns to local
      4ddf479 [Feynman Liang] Parallelize freqItemCounts
      ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal
      87fa021 [Feynman Liang] Improve extend prefix readability
      c2caa5c [Feynman Liang] Readability improvements and comments
      1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
      da0091b [Feynman Liang] Use lists for prefixes to reuse data
      cb2a4fc [Feynman Liang] Inline code for readability
      01c9ae9 [Feynman Liang] Add getters
      6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
      64271b3 [zhangjiajin] Modified codes according to comments.
      d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing.
      b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes
      095aa3a [zhangjiajin] Modified the code according to the review comments.
      baa2885 [zhangjiajin] Modified the code according to the review comments.
      6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan
      a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
      4dd1c8a [zhangjiajin] initialize file before rebase.
      078d410 [zhangjiajin] fix a scala style error.
      22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
      ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
      574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
      ba5df34 [zhangjiajin] Fix a Scala style error.
      4c60fb3 [zhangjiajin] Fix some Scala style errors.
      1dd33ad [zhangjiajin] Modified the code according to the review comments.
      89bc368 [zhangjiajin] Fixed a Scala style error.
      a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
      951fd42 [zhang jiajin] Delete Prefixspan.scala
      575995f [zhangjiajin] Modified the code according to the review comments.
      91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      d212a314
    • Joseph K. Bradley's avatar
      [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs · c5815930
      Joseph K. Bradley authored
      PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs.  This PR generalizes it.
      
      For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.
      
      I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684].  It should be useful for other iterative algorithms as well.
      
      Changes I made:
      * Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
      * Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
      * The subclasses for Graphs and RDDs implement those abstract methods.
      * I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.
      
      To review this PR, I recommend doing 2 diffs:
      (1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
      (2) diff between the 2 test suites
      
      CCing andrewor14 in case there are relevant changes to checkpointing.
      CCing feynmanliang in case you're interested in learning about checkpointing.
      CCing mengxr for final OK.
      Thanks all!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits:
      
      d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
      32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda
      0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data.
      568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs.
      c5815930
    • Yuhao Yang's avatar
      [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix · d31c618e
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7368
      Add QR decomposition for RowMatrix.
      
      I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable.
      
      The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf]
      Austin R. Benson, David F. Gleich, James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures", 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability.
      
      Currently I tried it on a 400000 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced)  to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement.
      
      Any trial and suggestion is welcome.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits:
      
      cec797b [Yuhao Yang] remove unnecessary qr
      0fb1012 [Yuhao Yang] hierarchy R computing
      3fbdb61 [Yuhao Yang] update qr to indirect and add ut
      0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      39b0b22 [Yuhao Yang] initial draft for discussion
      d31c618e
    • Liang-Chi Hsieh's avatar
      [SPARK-8838] [SQL] Add config to enable/disable merging part-files when merging parquet schema · 6175d6cf
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8838
      
      Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.
      
      In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7238 from viirya/option_partfile_merge and squashes the following commits:
      
      71d5b5f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      8816f44 [Liang-Chi Hsieh] For comments.
      dbc8e6b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      afc2fa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      d4ed7e6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      df43027 [Liang-Chi Hsieh] Get dataStatuses' partitions based on all paths.
      4eb2f00 [Liang-Chi Hsieh] Use given parameter.
      ea8f6e5 [Liang-Chi Hsieh] Correct the code comments.
      a57be0e [Liang-Chi Hsieh] Merge part-files if there are no summary files.
      47df981 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      4caf293 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      0e734e0 [Liang-Chi Hsieh] Use correct API.
      3b6be5b [Liang-Chi Hsieh] Fix key not found.
      4bdd7e0 [Liang-Chi Hsieh] Don't read footer files if we can skip them.
      8bbebcb [Liang-Chi Hsieh] Figure out how to test the config.
      bbd4ce7 [Liang-Chi Hsieh] Add config to enable/disable merging part-files when merging parquet schema.
      6175d6cf
    • Reynold Xin's avatar
      Fix flaky HashedRelationSuite · 5ba2d440
      Reynold Xin authored
      SparkEnv might not have been set in local unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7784 from rxin/HashedRelationSuite and squashes the following commits:
      
      435d64b [Reynold Xin] Fix flaky HashedRelationSuite
      5ba2d440
    • Reynold Xin's avatar
      Revert "[SPARK-9458] Avoid object allocation in prefix generation." · 4a8bb9d0
      Reynold Xin authored
      This reverts commit 9514d874.
      4a8bb9d0
    • zsxwing's avatar
      [SPARK-9335] [TESTS] Enable Kinesis tests only when files in extras/kinesis-asl are changed · 76f2e393
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7711 from zsxwing/SPARK-9335-test and squashes the following commits:
      
      c13ec2f [zsxwing] environs -> environ
      69c2865 [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-9335-test
      ef84a08 [zsxwing] Revert "Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS"
      f691028 [zsxwing] Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS
      7618205 [zsxwing] Enable Kinesis tests only when files in extras/kinesis-asl are changed
      76f2e393
    • Joseph Batchik's avatar
      [SPARK-8005][SQL] Input file name · 1221849f
      Joseph Batchik authored
      Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.
      
      This supports:
      
      `df.select(inputFileName())`
      
      and
      
      `sqlContext.sql("select input_file_name() from table")`
      
      Author: Joseph Batchik <josephbatchik@gmail.com>
      
      Closes #7743 from JDrit/input_file_name and squashes the following commits:
      
      abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
      d2f323d [Joseph Batchik] updates per review
      102061f [Joseph Batchik] updates per review
      75313f5 [Joseph Batchik] small fixes
      c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
      1221849f
    • Yijie Shen's avatar
      [SPARK-9428] [SQL] Add test cases for null inputs for expression unit tests · e127ec34
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9428
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7748 from yjshen/string_cleanup and squashes the following commits:
      
      e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
      26614d2 [Yijie Shen] MathFunctionSuite
      a402859 [Yijie Shen] complex_create, conditional and cast
      6e4e608 [Yijie Shen] arithmetic and cast
      52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
      e127ec34
    • Reynold Xin's avatar
      HOTFIX: disable HashedRelationSuite. · 712465b6
      Reynold Xin authored
      712465b6
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Pyt...
      e044705b
    • Alex Angelini's avatar
      Fix reference to self.names in StructType · f5dd1133
      Alex Angelini authored
      `names` is not defined in this context, I think you meant `self.names`.
      
      davies
      
      Author: Alex Angelini <alex.louis.angelini@gmail.com>
      
      Closes #7766 from angelini/fix_struct_type_names and squashes the following commits:
      
      01543a1 [Alex Angelini] Fix reference to self.names in StructType
      f5dd1133
  2. Jul 29, 2015
    • Reynold Xin's avatar
      [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode. · 27850af5
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7767 from rxin/SPARK-9462 and squashes the following commits:
      
      ef3e2d9 [Reynold Xin] Removed println
      713ac3a [Reynold Xin] More unit tests.
      bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
      27850af5
    • Reynold Xin's avatar
      [SPARK-9460] Avoid byte array allocation in StringPrefixComparator. · 07fd7d36
      Reynold Xin authored
      As of today, StringPrefixComparator converts the long values back to byte arrays in order to compare them. This patch optimizes this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte (unsigned).
      
      This only works on little-endian architecture right now.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7765 from rxin/SPARK-9460 and squashes the following commits:
      
      e4908cc [Reynold Xin] Stricter randomized tests.
      4c8d094 [Reynold Xin] [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
      07fd7d36
    • Reynold Xin's avatar
      [SPARK-9458] Avoid object allocation in prefix generation. · 9514d874
      Reynold Xin authored
      In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.
      
      I also removed the FLOAT prefix, opting for converting float directly to double.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7763 from rxin/sort-prefix and squashes the following commits:
      
      5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
      9514d874
    • Feynman Liang's avatar
      [SPARK-9440] [MLLIB] Add hyperparameters to LocalLDAModel save/load · a200e645
      Feynman Liang authored
      jkbradley MechCoder
      
      Resolves blocking issue for SPARK-6793. Please review after #7705 is merged.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7757 from feynmanliang/SPARK-9940-localSaveLoad and squashes the following commits:
      
      d0d8cf4 [Feynman Liang] Fix thisClassName
      0f30109 [Feynman Liang] Fix tests after changing LDAModel public API
      dc61981 [Feynman Liang] Add hyperparams to LocalLDAModel save/load
      a200e645
    • sethah's avatar
      [SPARK-6129] [MLLIB] [DOCS] Added user guide for evaluation metrics · 2a9fe4a4
      sethah authored
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #7655 from sethah/Working_on_6129 and squashes the following commits:
      
      253db2d [sethah] removed number formatting from example code
      b769cab [sethah] rewording threshold section
      d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
      3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
      c9dd058 [sethah] Cleaning up and formatting metrics user guide section
      6f31c21 [sethah] All example code for metrics section done
      98813fe [sethah] Most java and python example code added. Further latex formatting
      53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
      2a9fe4a4
    • Holden Karau's avatar
      [SPARK-9016] [ML] make random forest classifiers implement classification trait · 37c2d192
      Holden Karau authored
      Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:
      
      bf22fa6 [Holden Karau] Add missing imports for testing suite
      e948f0d [Holden Karau] Check the prediction generation from rawprediciton
      25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
      1a67e04 [Holden Karau] Use old decission tree stuff instead
      673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
      0d15b96 [Holden Karau] FIx typo
      5eafad4 [Holden Karau] add a constructor for rootnode + num classes
      fc6156f [Holden Karau] scala style fix
      2597915 [Holden Karau] take num classes in constructor
      3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
      222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
      16aea1c [Holden Karau] Make tests match the new models
      b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
      77b4114 [Holden Karau] Import vectors lib
      37c2d192
    • Bimal Tandel's avatar
      [SPARK-8921] [MLLIB] Add @since tags to mllib.stat · 103d8cce
      Bimal Tandel authored
      Author: Bimal Tandel <bimal@bimal-MBP.local>
      
      Closes #7730 from BimalTandel/branch_spark_8921 and squashes the following commits:
      
      3ea230a [Bimal Tandel] Spark 8921 add @since tags
      103d8cce
    • Reynold Xin's avatar
      [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances. · 86505962
      Reynold Xin authored
      We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.
      
      This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7759 from rxin/SPARK-9448 and squashes the following commits:
      
      c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
      86505962
    • Feynman Liang's avatar
      [SPARK-6793] [MLLIB] OnlineLDAOptimizer LDA perplexity · 2cc212d5
      Feynman Liang authored
      Implements `logPerplexity` in `OnlineLDAOptimizer`. Also refactors inference code into companion object to enable future reuse (e.g. `predict` method).
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7705 from feynmanliang/SPARK-6793-perplexity and squashes the following commits:
      
      6da2c99 [Feynman Liang] Remove get* from LDAModel public API
      8381da6 [Feynman Liang] Code review comments
      17f7000 [Feynman Liang] Documentation typo fixes
      2f452a4 [Feynman Liang] Remove auxillary DistributedLDAModel constructor
      a275914 [Feynman Liang] Prevent empty counts calls to variationalInference
      06d02d9 [Feynman Liang] Remove deprecated LocalLDAModel constructor
      afecb46 [Feynman Liang] Fix regression bug in sstats accumulator
      5a327a0 [Feynman Liang] Code review quick fixes
      998c03e [Feynman Liang] Fix style
      1cbb67d [Feynman Liang] Fix access modifier bug
      4362daa [Feynman Liang] Organize imports
      4f171f7 [Feynman Liang] Fix indendation
      2f049ce [Feynman Liang] Fix failing save/load tests
      7415e96 [Feynman Liang] Pick changes from big PR
      11e7c33 [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-6793-perplexity
      f8adc48 [Feynman Liang] Add logPerplexity, refactor variationalBound into a method
      cd521d6 [Feynman Liang] Refactor methods into companion class
      7f62a55 [Feynman Liang] --amend
      c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing
      aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
      2cc212d5
    • Josh Rosen's avatar
      [SPARK-9411] [SQL] Make Tungsten page sizes configurable · 1b0099fc
      Josh Rosen authored
      We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads.  These sizes are now controlled by a new configuration, `spark.buffer.pageSize`.  The new default is 64 megabytes.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:
      
      a43c4db [Josh Rosen] Fix pow
      2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
      bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
      ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
      0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
      bc734f0 [Josh Rosen] Rename configuration
      e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
      1b0099fc
    • Alexander Ulanov's avatar
      [SPARK-9436] [GRAPHX] Pregel simplification patch · b715933f
      Alexander Ulanov authored
      Pregel code contains two consecutive joins:
      ```
      g.vertices.innerJoin(messages)(vprog)
      ...
      g = g.outerJoinVertices(newVerts)
      { (vid, old, newOpt) => newOpt.getOrElse(old) }
      ```
      This can be simplified with one join. ankurdave proposed a patch based on our discussion in the mailing list: https://www.mail-archive.com/devspark.apache.org/msg10316.html
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      
      Closes #7749 from avulanov/SPARK-9436-pregel and squashes the following commits:
      
      8568e06 [Alexander Ulanov] Pregel simplification patch
      b715933f
    • Reynold Xin's avatar
      [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType. · 5340dfaf
      Reynold Xin authored
      We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
      so interval can be compared.
      
      Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7745 from rxin/calendarintervaltype and squashes the following commits:
      
      99f64e8 [Reynold Xin] One more line ...
      13466c8 [Reynold Xin] Fixed tests.
      e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
      5340dfaf
    • Iulian Dragos's avatar
      [SPARK-8977] [STREAMING] Defines the RateEstimator interface, and impements the RateController · 819be46e
      Iulian Dragos authored
      Based on #7471.
      
      - [x] add a test that exercises the publish path from driver to receiver
      - [ ] remove Serializable from `RateController` and `RateEstimator`
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      Author: François Garillot <francois@garillot.net>
      
      Closes #7600 from dragos/topic/streaming-bp/rate-controller and squashes the following commits:
      
      f168c94 [Iulian Dragos] Latest review round.
      5125e60 [Iulian Dragos] Fix style.
      a2eb3b9 [Iulian Dragos] Merge remote-tracking branch 'upstream/master' into topic/streaming-bp/rate-controller
      475e346 [Iulian Dragos] Latest round of reviews.
      e9fb45e [Iulian Dragos] - Add a test for checkpointing - fixed serialization for RateController.executionContext
      715437a [Iulian Dragos] Review comments and added a `reset` call in ReceiverTrackerTest.
      e57c66b [Iulian Dragos] Added a couple of tests for the full scenario from driver to receivers, with several rate updates.
      b425d32 [Iulian Dragos] Removed DeveloperAPI, removed rateEstimator field, removed Noop rate estimator, changed logic for initialising rate estimator.
      238cfc6 [Iulian Dragos] Merge remote-tracking branch 'upstream/master' into topic/streaming-bp/rate-controller
      34a389d [Iulian Dragos] Various style changes and a first test for the rate controller.
      d32ca36 [François Garillot] [SPARK-8977][Streaming] Defines the RateEstimator interface, and implements the ReceiverRateController
      8941cf9 [Iulian Dragos] Renames and other nitpicks.
      162d9e5 [Iulian Dragos] Use Reflection for accessing truly private `executor` method and use the listener bus to know when receivers have registered (`onStart` is called before receivers have registered, leading to flaky behavior).
      210f495 [Iulian Dragos] Revert "Added a few tests that measure the receiver’s rate."
      0c51959 [Iulian Dragos] Added a few tests that measure the receiver’s rate.
      261a051 [Iulian Dragos] - removed field to hold the current rate limit in rate limiter - made rate limit a Long and default to Long.MaxValue (consequence of the above) - removed custom `waitUntil` and replaced it by `eventually`
      cd1397d [Iulian Dragos] Add a test for the propagation of a new rate limit from driver to receivers.
      6369b30 [Iulian Dragos] Merge pull request #15 from huitseeker/SPARK-8975
      d15de42 [François Garillot] [SPARK-8975][Streaming] Adds Ratelimiter unit tests w.r.t. spark.streaming.receiver.maxRate
      4721c7d [François Garillot] [SPARK-8975][Streaming] Add a mechanism to send a new rate from the driver to the block generator
      819be46e
    • Joseph Batchik's avatar
      [SPARK-746] [CORE] Added Avro Serialization to Kryo · 069a4c41
      Joseph Batchik authored
      Added a custom Kryo serializer for generic Avro records to reduce the network IO
      involved during a shuffle. This compresses the schema and allows for users to
      register their schemas ahead of time to further reduce traffic.
      
      Currently Kryo tries to use its default serializer for generic Records, which will include
      a lot of unneeded data in each record.
      
      Author: Joseph Batchik <joseph.batchik@cloudera.com>
      Author: Joseph Batchik <josephbatchik@gmail.com>
      
      Closes #7004 from JDrit/Avro_serialization and squashes the following commits:
      
      8158d51 [Joseph Batchik] updated per feedback
      c0cf329 [Joseph Batchik] implemented @squito suggestion for SparkEnv
      dd71efe [Joseph Batchik] fixed bug with serializing
      1183a48 [Joseph Batchik] updated codec settings
      fa9298b [Joseph Batchik] forgot a couple of fixes
      c5fe794 [Joseph Batchik] implemented @squito suggestion
      0f5471a [Joseph Batchik] implemented @squito suggestion to use a codec that is already in spark
      6d1925c [Joseph Batchik] fixed to changes suggested by @squito
      d421bf5 [Joseph Batchik] updated pom to removed versions
      ab46d10 [Joseph Batchik] Changed Avro dependency to be similar to parent
      f4ae251 [Joseph Batchik] fixed serialization error in that SparkConf cannot be serialized
      2b545cc [Joseph Batchik] started working on fixes for pr
      97fba62 [Joseph Batchik] Added a custom Kryo serializer for generic Avro records to reduce the network IO involved during a shuffle. This compresses the schema and allows for users to register their schemas ahead of time to further reduce traffic.
      069a4c41
    • Reynold Xin's avatar
      [SPARK-9127][SQL] Rand/Randn codegen fails with long seed. · 97906944
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7747 from rxin/SPARK-9127 and squashes the following commits:
      
      e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
      97906944
    • Wenchen Fan's avatar
      [SPARK-9251][SQL] do not order by expressions which still need evaluation · 708794e8
      Wenchen Fan authored
      as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7593 from cloud-fan/sort and squashes the following commits:
      
      7b1bef7 [Wenchen Fan] add test
      daf206d [Wenchen Fan] add more comments
      289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
      708794e8
    • Davies Liu's avatar
      [SPARK-9281] [SQL] use decimal or double when parsing SQL · 15667a0a
      Davies Liu authored
      Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.
      
      This PR change to parse float number to decimal or double, based on it's  using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx
      
      This is a break change, should we doc it somewhere?
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7642 from davies/parse_decimal and squashes the following commits:
      
      1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
      5e142b6 [Davies Liu] fix scala style
      eca99de [Davies Liu] fix tests
      2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
      f4a320b [Davies Liu] Update SqlParser.scala
      1c48e34 [Davies Liu] use decimal or double when parsing SQL
      15667a0a
    • Yijie Shen's avatar
      [SPARK-9398] [SQL] Datetime cleanup · 6309b934
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9398
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7725 from yjshen/date_null_check and squashes the following commits:
      
      b4eade1 [Yijie Shen] inline daysToMonthEnd
      d09acc1 [Yijie Shen] implement getLastDayOfMonth to avoid repeated evaluation
      d857ec3 [Yijie Shen] add null check in DateExpressionSuite
      6309b934
  3. Jul 28, 2015
    • Josh Rosen's avatar
      [SPARK-9419] ShuffleMemoryManager and MemoryStore should track memory on a... · ea49705b
      Josh Rosen authored
      [SPARK-9419] ShuffleMemoryManager and MemoryStore should track memory on a per-task, not per-thread, basis
      
      Spark's ShuffleMemoryManager and MemoryStore track memory on a per-thread basis, which causes problems in the handful of cases where we have tasks that use multiple threads. In PythonRDD, RRDD, ScriptTransformation, and PipedRDD we consume the input iterator in a separate thread in order to write it to an external process.  As a result, these RDD's input iterators are consumed in a different thread than the thread that created them, which can cause problems in our memory allocation tracking. For example, if allocations are performed in one thread but deallocations are performed in a separate thread then memory may be leaked or we may get errors complaining that more memory was allocated than was freed.
      
      I think that the right way to fix this is to change our accounting to be performed on a per-task instead of per-thread basis.  Note that the current per-thread tracking has caused problems in the past; SPARK-3731 (#2668) fixes a memory leak in PythonRDD that was caused by this issue (that fix is no longer necessary as of this patch).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7734 from JoshRosen/memory-tracking-fixes and squashes the following commits:
      
      b4b1702 [Josh Rosen] Propagate TaskContext to writer threads.
      57c9b4e [Josh Rosen] Merge remote-tracking branch 'origin/master' into memory-tracking-fixes
      ed25d3b [Josh Rosen] Address minor PR review comments
      44f6497 [Josh Rosen] Fix long line.
      7b0f04b [Josh Rosen] Fix ShuffleMemoryManagerSuite
      f57f3f2 [Josh Rosen] More thread -> task changes
      fa78ee8 [Josh Rosen] Move Executor's cleanup into Task so that TaskContext is defined when cleanup is performed
      5e2f01e [Josh Rosen] Fix capitalization
      1b0083b [Josh Rosen] Roll back fix in PySpark, which is no longer necessary
      2e1e0f8 [Josh Rosen] Use TaskAttemptIds to track shuffle memory
      c9e8e54 [Josh Rosen] Use TaskAttemptIds to track unroll memory
      ea49705b
    • Wenchen Fan's avatar
      [SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of... · 429b2f0d
      Wenchen Fan authored
      [SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic
      
      We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.
      
      Fix `PullOutNondeterministic` rule to make it work for `Sort`.
      
      Also got a chance to cleanup the dataframe test suite.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7674 from cloud-fan/show and squashes the following commits:
      
      888934f [Wenchen Fan] fix sort
      c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
      429b2f0d
    • Yin Huai's avatar
      [SPARK-9422] [SQL] Remove the placeholder attributes used in the aggregation buffers · 3744b7fd
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-9422
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7737 from yhuai/removePlaceHolder and squashes the following commits:
      
      ec29b44 [Yin Huai]  Remove placeholder attributes.
      3744b7fd
    • Josh Rosen's avatar
      [SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType) · e78ec1a8
      Josh Rosen authored
      UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).
      
      In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.
      
      This patch addresses both of these issues and adds a regression test.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:
      
      c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
      6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
      e78ec1a8
Loading